Processing method including pre-issue load-hit-store (lhs) hazard prediction to reduce rejection of  load instructions

ABSTRACT

A processing method supporting out-of-order execution (OOE) includes load-hit-store (LHS) hazard prediction at the instruction execution phase, reducing load instruction rejections and queue flushes at the dispatch phase. The instruction dispatch unit (IDU) detects likely LHS hazards by generating entries for pending stores in a LHS detection table. The entries in the table contain an address field (generally the immediate field) of the store instruction and the register number of the store. The ISU compares the address field and register number for each load with entries in the table to determine if a likely LHS hazard exists and if an LHS hazard is detected, the load is dispatched to the issue queue of the load-store unit (LSU) with a tag corresponding to the matching store instruction, causing the LSU to dispatch the load only after the corresponding store has been dispatched for execution.

The present Application is a Continuation of U.S. patent applicationSer. No. 14/522,811, filed on Oct. 24, 2014 and claims priority theretounder 35 U.S.C. §120. The disclosure of the above-referenced parent U.S.Patent Application is incorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is related to processing systems and processors,and more specifically to techniques for predicting load-hit-storehazards at dispatch times to reduce rejection of dispatched loadinstructions.

2. Description of Related Art

In pipelined processors supporting out-of-order execution (OOE),overlaps between store and load instructions causing load-hit-storehazards represent a serious bottleneck in the data flow between the loadstore unit (LSU) and the instruction dispatch unit (IDU). In particular,in a typical pipelined processor, when a load-hit-store hazard isdetected by the LSU, the load instruction that is dependent on theresult of the store instruction is rejected, generally several times,and reissues the load instruction along with flushing all newerinstructions following the load instruction. The above-described rejectand reissue operation not only consumes resources of the load-store datapath(s) within the processor, but can also consume issue queue space inthe load-store execution path(s) by filling the load-store issue queuewith rejected load instructions that must be reissued. When such an LHShazard occurs in a program loop, the reject and reissue operation canlead to a dramatic reduction in system performance.

In some systems, the reissued load instruction entries are tagged withdependency flags, so that subsequent reissues will only occur after thestore operation on which the load instruction depends, preventingrecurrence of the reissue operations. However, rejection of the firstissue of the load instruction and the consequent flushing of newerinstructions still represents a significant performance penalty in OOEprocessors.

It would therefore be desirable to provide a method for managingload-store operations with reduced rejection and reissue of operations,in particular load rejections due to load-hit-store hazards.

BRIEF SUMMARY OF THE INVENTION

The invention is embodied in a method that reduces rejection of loadinstructions by predicting likely load-hit-store hazards. The method isa method of operation of a processor core.

The processor core is embodied in a processor core supportingout-of-order execution that detects likely load-hit-store hazards. Whenan instruction dispatch unit decodes a fetched instruction, if theinstruction is a store instruction, address information is stored in aload-hit-store detection table. The address information is generally thebase registers used to generate the effective address of the storeoperation in register-based addressing and/or the immediate field of theinstruction for immediate addressing. When a subsequent load instructionis encountered, the instruction dispatch unit checks the load-hit-storedetection table to determine whether or not an entry in the table hasmatching address information. If a matching entry exists in the table,the instruction dispatch unit forwards the load instruction with a tagcorresponding to the entry, so that the load-store unit will execute theload instruction after the corresponding store has been executed. If nomatching entry exists in the table, the load instruction is issueduntagged.

The foregoing and other objectives, features, and advantages of theinvention will be apparent from the following, more particular,description of the preferred embodiment of the invention, as illustratedin the accompanying drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives, and advantages thereof,will best be understood by reference to the following detaileddescription of the invention when read in conjunction with theaccompanying Figures, wherein like reference numerals indicate likecomponents, and:

FIG. 1 is a block diagram illustrating a processing system in accordancewith an embodiment of the present invention.

FIG. 2 is a block diagram illustrating details of a processor core 20 inaccordance with an embodiment of the present invention.

FIG. 3 is a block diagram illustrating details within processor core 20of FIG. 2 in accordance with an embodiment of the present invention.

FIG. 4 is a table depicting entries within LHS detection table 41 ofprocessor core 20 in accordance with an embodiment of the presentinvention.

FIG. 5 is a flowchart depicting a method of dispatching load/storeinstructions in accordance with an embodiment of the present invention.

FIG. 6 is a flowchart depicting a method of issuing load/storeinstructions in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to processors and processing systems inwhich rejects of load instructions due to load-hit-store (LHS) hazardsis reduced by predicting the occurrence of such hazards using a LHSprediction table to track dispatched stores that may or may not havebeen issued/executed. Load instructions are examined at dispatch time todetermine whether or not a pending store exists that has not beencommitted for a cache write or that has otherwise been flushed from theload-store execution path. If an LHS hazard is detected, the loadinstruction is dispatched with an ITAG matching the ITAG of the storeinstruction corresponding to the entry in the LHS prediction table, sothat the load-store unit will issue the load instruction dependent onthe store result, i.e., will retain the load instruction in its issuequeue until the store instruction is committed or flushed, preventingrejections of load instructions due to identification of LHS hazardsduring issue of the load instructions.

Referring now to FIG. 1, a processing system in accordance with anembodiment of the present invention is shown. The depicted processingsystem includes a number of processors 10A-10D, each in conformity withan embodiment of the present invention. The depicted multi-processingsystem is illustrative, and a processing system in accordance with otherembodiments of the present invention includes uni-processor systemshaving symmetric multi-threading (SMT) cores. Processors 10A-10D areidentical in structure and include cores 20A-20B and a local storage 12,which may be a cache level, or a level of internal system memory.Processors 10A-10B are coupled to a main system memory 14, a storagesubsystem 16, which includes non-removable drives and optical drives,for reading media such as a CD-ROM 17. The illustrated processing systemalso includes input/output (I/O) interfaces and devices 18 such as miceand keyboards for receiving user input and graphical displays fordisplaying information. While the system of FIG. 1 is used to provide anillustration of a system in which the processor architecture of thepresent invention is implemented, it is understood that the depictedarchitecture is not limiting and is intended to provide an example of asuitable computer system in which the techniques of the presentinvention are applied.

Referring now to FIG. 2, details of processor cores 20A-20B of FIG. 1are illustrated in depicted processor core 20. Processor core 20includes an instruction fetch unit (IFU) 22 that fetches one or moreinstruction streams from cache or system memory and presents theinstruction stream(s) to an instruction decode unit 24. An instructiondispatch unit (IDU) 26 dispatches the decoded instructions to a numberof internal processor pipelines. The processor pipelines each includeone of issue queues 27A-27D and an execution unit provided by branchexecution unit (BXU) 28, condition result unit (CRU) 29, load-store unit(LSU) 30 or floating point units (FPUs) 31A-31B. Registers such as acounter register (CTR) 23A, a condition register (CR) 23B,general-purpose registers (GPR) 23D, and floating-point result registers(FPR) 23C provide locations for results of operations performed by thecorresponding execution unit(s). A global completion table (GCT) 21provides an indication of pending operations that is marked as completedwhen the results of an instruction are transferred to the correspondingone of result registers 23A-23D. In embodiments of the presentinvention, a LHS prediction logic 40 within IDU 26 manages a LHSdetection table 41 that contains entries for all pending storeoperations, e.g., all store operations that have not reached the pointof irrevocable execution. IDU 26 also manages register mapping via aregister mapper 25 that allocates storage in the various register setsso that concurrent execution of program code can be supported by thevarious pipelines. LSU 30 is coupled to a store queue (STQ) 42 and aload queue (LDQ) 43, in which pending store and load operations arerespectively queued for storages within a data cache 44 that providesfor loading and storing of data values in memory that are needed ormodified by the pipelines in core 20. Data cache 44 is coupled to one ormore translation look-aside buffers (TLB) 45 that map real or virtualaddresses in data cache 44 to addresses in an external memory space.

Referring now to FIG. 3, a block diagram illustrating details of IDU 26within processor core 20 of FIG. 2 is shown. LHS prediction logic 40provides tracking of pending store operations by generating entries foreach store instruction decoded by instruction decode unit 24 in LHSdetection table 41. When store instructions are received by IDU 26,address information associated with the store instruction, which in theparticular embodiment are the base registers and/or the immediate valueused in calculating the effective address (EA) of the store operation,is inserted in LHS detection table 41. The entry is also populated withan instruction tag (ITAG) identifying the particular store instruction,so that the entry in LHS detection table 41 can be invalidated when theparticular store instruction completes, along with other informationsuch as the thread identifier for the instruction, the valid bit and thestore instruction type, which is used to determine which field(s) tocompare for address matching. FIG. 4 shows an exemplary LHS detectiontable 41 containing two valid entries and one entry that has beenretired due to completion/commit of the store instruction to data cache44 or invalidated due to a flush. When IDU 26 receives a loadinstruction, LHS prediction logic compares the address information(e.g., immediate field and/or base registers, depending on the type ofaddressing) of the load instruction with each entry in LHS detectiontable 41, which may be facilitated by implementing LHS detection table41 with a content-addressable memory (CAM) that produces the ITAG of theLHS detection table entry given the address information, threadidentifier and store instruction type for valid entries. LHS detectiontable 41 may alternatively be organized as a first-in-first-out (FIFO)queue. The load instruction is then dispatched to issue queue 27D withthe ITAG of the entry, in order to cause LSU 30 to retain the loadinstruction in issue queue 27D until the store instruction causing theLHS hazard in conjunction with the load instruction has issued,completed, or has been otherwise irrevocably committed or flushed. Inone embodiment of the invention, the lookup in LHS detection table 41locates the most recent entry matching the look-up information, so thatif multiple matching entries exist in LHS detection table 41, the loadinstruction will be queued until the last store instruction causing anLHS hazard has been issued/completed/committed/flushed. In anotherembodiment, before an entry is generated in LHS detection table 41, alook-up is performed to determine if a matching entry exists, and if so,the existing entry is invalidated or updated with a new ITAG. If LHSdetection table 41 is full, the oldest entry is overwritten.

It should be noted that the above-described matching does not generallydetect all LHS hazards, since, for example, a store instruction usingimmediate addressing may hit the same address as a load instructionusing register or register indirect addressing, and a matching entry inLHS detection table 41 will not be found for the load. Such an LHShazard will instead be rejected during the issue phase after the full EAhas been computed for both the load and store instructions. However,most likely LHS hazards should be detected under normal circumstancesand the number of load rejects due to LHS hazards dramatically reduced.Further, an entry may be found in LHS detection table 41 that is flaggedas an LHS hazard and in actuality is not, for example, when a baseregister value has been modified between a register-addressed load and apreceding register-addressed store using the same base register pair.Therefore, the method detects likely LHS hazards and not guaranteedaddress conflicts/overlaps. However, such occurrences should be rarecompared to the number of actual LHS hazards detected.

Referring now to FIG. 5, a method of operation of processor core 20 inaccordance with an embodiment of the present invention, is illustratedin a flowchart. As illustrated in FIG. 5, when an IFU fetchesinstruction(step 60) and the instruction is decoded (step 61), if theinstruction is a store instruction (decision 62), and if there is anexisting entry in LHS detection table 41 that matches the base registers(register-based addressing) and/or immediate field (immediateaddressing) of the store instruction (decision 63), the existing entryis invalidated, or alternatively over-written (step 64). The baseregisters and immediate field of the store instruction are written to anentry in LHS detection table 41 (step 65) and the store instruction isdispatched (step 66). If the instruction is not a store instruction(decision 62), but is a load instruction (decision 67), if the baseregisters (register-based addressing) or immediate field (immediateaddressing) match an entry in LHS detection table 41 (decision 68), theload instruction is dispatched to issue queue 27D with an ITAG of thestore instruction corresponding to the table entry (step 70). Otherwise,the load instruction is dispatched without an ITAG (step 69), as areinstructions that neither load nor store instructions. Until the systemis shut down (decision 71), steps 60-70 are repeated.

Referring now to FIG. 6, another method of operation of processor core20 in accordance with an embodiment of the present invention isillustrated in a flowchart. As illustrated in FIG. 6, LSU 30 peeks issuequeue 27D (step 80) and if the next instruction has an existingdependency (decision 81), such as a dependency generated by the methodof FIG. 5 when the load instruction is dispatched with an ITAG of astore for which an LHS hazard is predicted, the peek moves to the nextinstruction (step 82). If the next instruction does not have an existingdependency (decision 81), if the instruction is a load instruction(decision 83), and LDQ 43 is not full (decision 84), the loadinstruction is issued to LDQ 43 (step 87). Similarly, if the instructionis a store instruction (decision 85) , and STQ 42 is not full (decision86), the store instruction is issued to STQ 86 (step 87). Until thesystem is shut down (decision 88), steps 80-87 are repeated.

While the invention has been particularly shown and described withreference to the preferred embodiments thereof, it will be understood bythose skilled in the art that the foregoing and other changes in form,and details may be made therein without departing from the spirit andscope of the invention.

What is claimed is:
 1. A method of operation of a processor core, themethod comprising: fetching instructions of an instruction stream;dispatching instructions of the instruction stream by an instructiondispatch unit of the processor core that dispatches the instructions toissue queues, according to a type of the instructions; detecting likelyload-hit-store hazards prior to the dispatch of load instructions to anissue queue of a load-store unit of the processor core; and identifyingthe likely load-hit-store hazards to the load-store unit, wherebyrejections of the load instructions by the load-store unit due toload-hit-store hazards is reduced.
 2. The method of claim 1, wherein theinstruction dispatch unit detects store instructions of the instructionstream during the dispatching of the store instructions and stores storeaddress information associated with the store instructions incorresponding entries in a load-hit-store detection table, and whereinthe detecting likely load-hit-store hazards comprises detecting loadinstructions of the instruction stream and comparing the store addressinformation of the entries in the table with load address information ofload instructions of the instruction stream.
 3. The method of claim 2,further comprising: responsive to the detecting of a store operation,writing the store address information associated with the storeoperation to the load-hit-store detection table and dispatching thestore operation to the issue queue of the load-store unit of theprocessor core; responsive to the detecting of a load instruction,comparing the load address information of the load instruction toentries in the load-hit-store detection table corresponding to storeoperations occurring earlier in the instruction stream to determine if alikely load-hit-store hazard exists between the load instruction and agiven one of the store operations; responsive to the comparingdetermining that the likely load-hit-store hazard exists between theload instruction and the given store operation, dispatching the loadinstruction to the issue queue of the load-store unit of the processorcore along with a tag identifying the given store operation; andresponsive to the comparing determining that the likely load-hit-storehazard does not exist between the load instruction and the given storeoperation, dispatching the load instruction to the issue queue of theload-store unit of the processor core without the tag.
 4. The method ofclaim 2, wherein the store address information is one or both of animmediate field of the store instruction and one or more base registernumbers of the store instruction.
 5. The method of claim 3, furthercomprising: the load-store unit examining a next entry of the issuequeue to determine whether or not a next operation is a load instructionwith a corresponding tag; the load-store unit, responsive to determiningthat the load instruction with a corresponding tag is not present,processing the next entry for execution by the load-store unit; theload-store unit examining the next entry of the issue queue to determinewhether or not the next operation is a store operation; the load-storeunit, responsive to determining that the next operation is a storeoperation, examining the issue queue to determine whether a loadinstruction having a corresponding tag matching a tag of the storeoperation is present; the load-store unit, responsive to determiningthat the next operation is a store operation, processing the next entryfor execution by the load-store unit; and the load-store unit,responsive to determining that the load instruction having thecorresponding tag matching the tag of the store operation is present,processing the load instruction for execution by the load-store unitsubsequent to processing the next entry.
 6. The method of claim 3,further comprising: responsive to detecting a store operation in theinstruction stream, comparing entries in the load-hit-store detectiontable with the store address information of the store operation; andresponsive to the comparing detecting a match between the store addressinformation of the store instruction and an entry in the load-hit-storedetection table, invalidating the entry in the load-hit-store detectiontable prior to the instruction dispatch unit storing an entrycorresponding to the store instruction in the load-hit-store detectiontable, whereby only a single valid entry in the load-hit-store detectiontable contains identical store address information at any time.
 7. Themethod of claim 3, wherein the comparing compares a most-recently-storedmatching entry in the load-hit-store detection table that has a matchbetween the load address information of the load instruction and themost-recently-stored matching entry in the load-hit-store detectiontable, whereby multiple valid entries in the load-hit-store detectiontable may match a particular load address information, without causing aload-hit-store hazard.